Search CORE

17 research outputs found

Recommended from our members

Parsing Early Modern English for Linguistic Search

Author: Kulick Seth
Ryant Neville
Santorini Beatrice
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/02/2022
Field of study

This work addresses the question of whether the output of a state-of-the-art parser is accurate enough to support research in theoretical linguistics. In order to build reliable models of syntactic change, we aim to eventually parse the 1.5-billion-word Early English Books Online (EEBO) corpus. But since EEBO is not yet parsed, we begin by constructing and testing a parser on the 1.7-million-word Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME). In order to obtain robust results, we define an 8-fold split on PPCEME. We then evaluate the parser with evalb and, more relevantly for us, with a task-specific metric - namely, its accuracy in parsing 6 sentence types necessary to track the rise of auxiliary do (as in They did not come vs. its historical precursor They came not). Retrieving the relevant sentences from the gold and test versions with CorpusSearch queries, we find that the parser\u27s accuracy promises to be sufficient for our purposes. A remaining concern is the variability of the output, which we plan to address with three pieces of future work sketched in the conclusion

ScholarWorks@UMass Amherst

Recommended from our members

Parsing Early English Books Online for Linguistic Search

Author: Kulick Seth
Ryant Neville
Santorini Beatrice
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/06/2023
Field of study

This work addresses the question of how to evaluate a state-of-the-art parser on Early English Books Online (EEBO), a 1.5-billion-word collection of unannotated text, for utility in linguistic research. Earlier work has trained and evaluated a parser on the 1.7-million-word Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME) and defined a query-based evaluation to score the retrieval of 6 specific sentence types of interest. However, significant differences between EEBO and the manually-annotated PPCEME make it inappropriate to assume that these results will generalize to EEBO. Fortunately, an overlap of source material in PPCEME and EEBO allows us to establish a token alignment between them and to score the POS-tagging on EEBO. We use this alignment together with a more principled version of the query-based evaluation to score the recovery of sentence types on this subset of EEBO, thus allowing us to estimate the increase in error rate on EEBO compared to PPCEME. The increase is largely due to differences in sentence segmentation between the two corpora, pointing the way to further improvements

ScholarWorks@UMass Amherst

A Part-of-Speech Tagger for Yiddish: First Steps in Tagging the Yiddish Book Center Corpus

Author: Kulick Seth
Ryant Neville
Santorini Beatrice
Wallenberg Joel
Publication venue
Publication date: 03/04/2022
Field of study

We describe the construction and evaluation of a part-of-speech tagger for Yiddish (the first one, to the best of our knowledge). This is the first step in a larger project of automatically assigning part-of-speech tags and syntactic structure to Yiddish text for purposes of linguistic research. We combine two resources for the current work - an 80K word subset of the Penn Parsed Corpus of Historical Yiddish (PPCHY) (Santorini, 2021) and 650 million words of OCR'd Yiddish text from the Yiddish Book Center (YBC). We compute word embeddings on the YBC corpus, and these embeddings are used with a tagger model trained and evaluated on the PPCHY. Yiddish orthography in the YBC corpus has many spelling inconsistencies, and we present some evidence that even simple non-contextualized embeddings are able to capture the relationships among spelling variants without the need to first "standardize" the corpus. We evaluate the tagger performance on a 10-fold cross-validation split, with and without the embeddings, showing that the embeddings improve tagger performance. However, a great deal of work remains to be done, and we conclude by discussing some next steps, including the need for additional annotated training and test data

arXiv.org e-Print Archive

CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

Author: Arora Ashish
Barker Jon
Boeddeker Christoph
Chang Xuankai
Fujita Yusuke
Horiguchi Shota
Kanda Naoyuki
Khudanpur Sanjeev
Mandel Michael
Manohar Vimal
Ni Zhaoheng
Povey Daniel
Raj Desh
Ryant Neville
Snyder David
Subramanian Aswin Shanmugam
Trmal Jan
Vincent Emmanuel
Watanabe Shinji
Yair Bar Ben
Yoshioka Takuya
Publication venue
Publication date: 02/05/2020
Field of study

Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous CHiME-5 recordings except for accurate array synchronization. The material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech. This paper provides a baseline description of the CHiME-6 challenge for both segmented multispeaker speech recognition (Track 1) and unsegmented multispeaker speech recognition (Track 2). Of note, Track 2 is the first challenge activity in the community to tackle an unsegmented multispeaker speech recognition scenario with a complete set of reproducible open source baselines providing speech enhancement, speaker diarization, and speech recognition modules

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

CHiME-6 Challenge: Tackling multispeaker speech recognition for unsegmented recordings

Author: Arora Ashish
Barker Jon
Boeddeker Christoph
Chang Xuankai
Fujita Yusuke
Horiguchi Shota
Kanda Naoyuki
Khudanpur Sanjeev
Mandel Michael
Manohar Vimal
Ni Zhaoheng
Povey Daniel
Raj Desh
Ryant Neville
Snyder David
Subramanian Aswin,
Trmal Jan
Vincent Emmanuel
Watanabe Shinji
Yair Bar,
Yoshioka Takuya
Publication venue: HAL CCSD
Publication date: 04/05/2020
Field of study

International audienceFollowing the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous CHiME-5 recordings except for accurate array synchronization. The material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech. This paper provides a baseline description of the CHiME-6 challenge for both segmented multispeaker speech recognition (Track 1) and unsegmented multispeaker speech recognition (Track 2). Of note, Track 2 is the first challenge activity in the community to tackle an unsegmented multispeaker speech recognition scenario with a complete set of reproducible open source baselines providing speech enhancement, speaker diarization, and speech recognition modules

INRIA a CCSD electronic archive server

SCALE-SPACE EXPANSION OF ACOUSTIC FEATURES IMPROVES SPEECH EVENT DETECTION

Author: Jiahong Yuan
Mark Liberman
Neville Ryant
Publication venue
Publication date: 23/04/2020
Field of study

ABSTRACT In a system for detecting and measuring phonetic events (here bursts, voice onsets, and voice-onset times), we show that the addition of features smoothed at multiple scales can improve both recall (the proportion of events correctly identified) and measurement accuracy (the timing of events and the difference between event times, relative to expert human judgments). Multi-scale (or "scale space") features had an especially strong positive effect on robustness across datasets with different materials and recording conditions. Standard machine-learning classifiers were able to integrate information across scales, without any special treatment of the multiscale features

CiteSeerX